Part of the series: Local LLM inference in OCI Ampere A1 (8 posts)
- Benchmarking CPU-only LLM Inference with Optimization: Caching and Batching
- Benchmarking CPU-only LLM Inference: Prompt Variation
- Benchmark local LLM inference engines in Oracle Ampere
- Serving Plamo-2-Translate LLM for Japanese-English Translation on Oracle Ampere VM
- Convert and quantize LLM models with Ampere optimized llama.cpp container
- How to run llama.cpp on Arm-based Ampere with Oracle Linux
- Serve and inference with local LLMs via Ollama & Docker Model Runner in Oracle Ampere
- Running LLMs locally on Ampere A1 Linux VM: Comparing options
My last blog post has examined different optimization strategies such as caching and batching (sequential and concurrent) in a CPU-only environment, with code and result showing their observed effectiveness.
In this post, let’s compare different llama-server configurations to see if any settings can further improve caching and batching. We will also compare the new results of each test with our initial baseline (fixed short prompt, sequential runs): 94.52ms TTFT and 10.57 TPS.
Methodology
Remember we ran our caching and batching benchmarks with these optimized flags:
./llama-server -m /path/to/qwen-2.5-3b-q4_k_m.gguf \
-t 4 \
--threads-batch 4 \
-np 8 \
--host 0.0.0.0 --port 8080Key parameters:
-m: model path-t 4: The number of threads for prompt processing (pre-fill). 4 = our Ampere A1 free-tier compute instance’s core count.--threads-batch 4: The number of threads for token generation (decode).-np 8: Number of context slots for continuous batching. We’ll start high (e.g., 8) to test queuing limit.
Let’s see how the same benchmarks will fare if we run llama-server with all default settings. That means, no multi-threads for prompt evaluation and token generation.
Results:
--- Caching Test (LONG PROMPT) ---
1. Cold Run (Prefix Tokens: 2571)... TTFT: 172979.51ms
2. Warm Runs (Cache Reuse)...
Warm Run 1 TTFT: 1359.32ms
Warm Run 2 TTFT: 577.33ms
Warm Run 3 TTFT: 573.61ms
Summary: Cold TTFT: 172979.51ms | Avg Warm TTFT: 836.76ms
Cache Speedup: 99.5%
--- Batching Benchmarks (Short Prompt, Max Tokens=50) ---
**1. Sequential Baseline (No Queuing/Contention):**
Running Sequential Batch (BS=1)... Done.
Seq Batch 1: Avg TTFT 1363.26ms | Avg Gen+Prefill TPS 7.63
Running Sequential Batch (BS=2)... Done.
Seq Batch 2: Avg TTFT 591.94ms | Avg Gen+Prefill TPS 10.44
Running Sequential Batch (BS=4)... Done.
Seq Batch 4: Avg TTFT 577.37ms | Avg Gen+Prefill TPS 10.02
**2. Concurrent Test (Measures Queuing Tax and Aggregate TPS):**
Conc Batch 1: Avg TTFT 597.54ms | **Aggregate TPS 10.39** | Success 100.0%
Conc Batch 2: Avg TTFT 1939.59ms | **Aggregate TPS 10.11** | Success 100.0%
Conc Batch 4: Avg TTFT 7757.64ms | **Aggregate TPS 9.70** | Success 100.0%
Conc Batch 8: Avg TTFT 13498.76ms | **Aggregate TPS 10.22** | Success 100.0%
-> Interpretation: High TTFT confirms **queuing latency** on the CPU is a major factor at this concurrency level.
Conc Batch 12: Avg TTFT 18354.88ms | **Aggregate TPS 10.59** | Success 100.0%
-> Interpretation: High TTFT confirms **queuing latency** on the CPU is a major factor at this concurrency level. Caching
This test is a direct comparison between a sequential process (default flags) and a highly parallelized process (optimized flags) for the same heavy workload.
It isolated the impact of parallel computation on the Prompt Evaluation (Prefill) phase, which is where the long-prompt speed bottleneck occurs. The parallelization effectiveness is proven by the radical difference in the “Cold Run” TTFT for the 2571-token prompt across the two tests:
| Metric | Optimized Flags: -t 4, etc. |
Default | Conclusion |
|---|---|---|---|
| Long Prompt TTFT (Cold) | 186.07 ms | 172979.51 ms (2.88 minutes) | Default Config is nearly 930x Slower at initial context pre-fill. |
Observations:
- Optimized Flags (186.07 ms): This very fast time indicates the optimization flags were highly effective at parallelizing the prompt evaluation (pre-fill phase). The parallel matrix multiplication used all available CPU cores/threads to load the model’s weights and process the long context quickly.
- Default (172,777.41 ms): The near 3-minute delay shows that the default configuration is running the long prompt evaluation almost entirely sequentially or with minimal parallelization, demonstrating the huge memory bandwidth cost of processing thousands of tokens on a CPU-only system.
- Warm Run Speedup: In both cases, the **cache speedup is extreme \((\ge 91.7\%\), up to \(99.7\%\)\). This consistently proves that once the long prompt is processed and stored in the KV cache, the memory-bound work is skipped, and subsequent requests are bottlenecked only by the first token generation.
Batching
| Metric | Optimized Flags: -t 4, etc. |
Default | Conclusion |
|---|---|---|---|
| Short Prompt TTFT (BS=1) | 1606.63 ms | 1363.26 ms | Similar (both slow, suggesting client-side Python overhead). |
| Peak Sequential TPS | 9.97 TPS (at BS=4) | 10.44 TPS (at BS=2) | Default Config has Slightly Higher Peak TPS. |
| Peak Concurrent TPS | 11.06 TPS (at BS=12) | 10.59 TPS (at BS=12) | Similar under extreme load. |
| High Concurrency TTFT | 9430.09 ms (BS=12) | 18354.88 ms (BS=12) | Default Config has 2x Higher Queuing Tax. |
Single-Stream TTFT
The \(BS=1\) TTFT remains consistently high in both tests (\(\mathbf{\sim 1600 \text{ ms}}\) vs \(\mathbf{\sim 1360 \text{ ms}}\)).
This confirms that the extreme slowness for single, isolated requests is not primarily due to the server flags, but likely due to the Python client’s threading overhead (asyncio.to_thread) and the latency introduced by the OpenAI API server wrapper itself, which adds a significant base overhead before the fast prompt evaluation even begins.
Single-Stream Throughput (TPS)
Peak Sequential TPS is \(\sim 10 \text{ TPS}\) with default config.
This is the most accurate measure of the CPU’s maximum sustainable token generation speed (decode phase). The fact that the peak is achieved with the default config confirms that the custom flags were either slightly sub-optimal or introduced unnecessary overhead for the memory-bound decoding process.
Concurrency Queue
| Concurrent Batch Size | Optimized Flags (Avg TTFT -ms) | Default (Avg TTFT -ms) |
|---|---|---|
| BS=1 | 698.89 | 597.54 |
| BS=2 | 1408.57 | 1939.59 |
| BS=4 | 2707.35 | 7757.64 |
| BS=8 | 6594.63 | 13498.76 |
| BS=12 | 9430.09 | 18354.88 |
The TTFT for BS=8 and BS=12 in the default server is significantly higher ((13 18 )) than the optimized server, indicating a much higher queuing tax. Without the explicit thread management from the flags, the default system scheduler struggles more under high concurrent load.
Aggregate TPS in concurrent batches remains constant in both default (10.12 - 11.06) and optimized server (9.7 - 10.59), confirming the throughput ceiling for our CPU/memory-based setup.
Compare with V1 Baseline
Here’s a high level comparison is between the V1 Baseline (94.52ms TTFT / 10.57 TPS) and the new benchmarks with optimized flags and default settings.
| Feature | Test 1: V1 Baseline (Fixed short prompt, default config) | Test 2: New Script (Optimized Flags) | Test 3: New Script (Default Config) |
|---|---|---|---|
| Server Config | Default (-t 1 implicit, poor pre-fill parallelization). |
Optimized (-t 4, --threads-batch 4, etc.). |
Default (-t 1 implicit, poor pre-fill parallelization). |
| Prompt Used | ()-token Fixed Prompt. | ()-token Short Prompt (for (=1)). | ()-token Short Prompt (for (=1)). |
| Baseline TTFT | (94.52 ) | (1606.63 ) | (1363.26 ) |
| Baseline TPS | (10.57 ) | (9.12 ) | (7.63 ) |
| Interpretation | Best Single-Request Performance: Shows the true, efficient, single-threaded speed of the older system/client stack. This is the hardware’s best decode speed and () with low overhead. | High Overhead, Max Batching: Poor () due to client/threading overhead, but fastest long-prompt () ((186 )) due to effective ()-core pre-fill parallelization. | Lowest Long-Prompt Performance: TTFT is similar to the optimized run’s (=1) due to client overhead, but the Long-Prompt Cold Run failed dramatically (( )) due to lack of pre-fill parallelization. |
| Overall Value | True peak decode speed | Best long-context latency | Unsuitable for long context |
When we zoom into the Baseline V1 (94.52ms TTFT / 10.57 TPS) and the new test with optimized flags, we notice significantly different performance profiles based on the specific load (caching vs. contention).
| Metric | Baseline (Short Prompt) | Caching (Warm Run Avg) | Sequential (BS=1) | Concurrent (BS=8 Aggregate) |
|---|---|---|---|---|
| TTFT (ms) | 94.52 ms | 15.49 ms | 1606.63 ms | 6594.63 ms |
| Throughput (TPS) | 10.57 TPS | N/A (single token) | 10.91 TPS | 8.71 TPS |
While V1 Baseline (Fixed short prompt, Default Config) gives us the cleanest single-request speed, real-world LLM deployments are defined by two things: concurrency and context management, both of which were directly tested in the optimized script.
On the other hand, the V1 Baseline is excellent for determining the theoretical peak single-stream decode speed (\(\mathbf{10.57 \text{ TPS}}\)), but it fails to capture critical server effects:
- No Concurrency: It measures sequential speed, not the real-world performance drop under simu ltaneous load.
- No Optimization Test: It doesn’t test the effectiveness of caching or the parallelization needed for long prompts.
In terms of caching (Baseline TTFT vs. Warm Run), the V1 Baseline TTFT of 94.52 ms is completely dwarfed by the new test’s Avg Warm TTFT of 15.49 ms. This confirms that KV cache reuse is functioning exceptionally well. The pre-fill time is essentially eliminated, resulting in an 83.6% reduction in TTFT relative to the old baseline, or a 91.7% speedup relative to the new cold run. The 2571-token initial pre-fill takes 186.07 ms (the new cold run TTFT), but subsequent requests using that cache take only \(1/12^{th}\) of the time, showing that the memory bandwidth cost of the pre-fill is the dominant factor in TTFT for long prompts.
Conclusion
The llama-server optimized flags were an effective performance tuning tool with these gains:
Enables fast prompt pre-fill: The Default config run showed that without parallelization, the long prompt evaluation was practically sequential, taking \(\approx 3\) minutes in the cold run. This confirms that the default setting completely fails to utilize the CPU’s multi-core power for this workload. The Optimized flags run showed that by explicitly enabling 4-way parallelization, the same long context was processed in less than \(0.2 \text{ seconds}\). This massive difference (a \(\mathbf{\sim 99.9\%}\) reduction in time from the default configuration) is the concrete measurement of the effectiveness of parallelization on our 4-core CPU for long prompts.
Reduces queuing tax under concurrent load, cutting nearly 50% of TTFT from the default config.
The Default Config test reveals the system’s actual limits in prompt evaluation and concurrent throughput, especially for long contexts. The optimized configuration is superior for real-world server use cases involving long chat histories or high concurrency, primarily due to the massive reduction in long prompt processing time.
This shows that the system administrator’s configuration choice has a direct, dramatic, and realistic impact on client-side latency.